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This paper considers estimating a covariance matrix of p vari- 
ables from n observations by either banding or tapering the sample 
covariance matrix, or estimating a banded version of the inverse of 
the covariance. We show that these estimates are consistent in the 
operator norm as long as {\ogp)/n — > 0, and obtain explicit rates. The 
results are uniform over some fairly natural well-conditioned families 
of covariance matrices. We also introduce an analogue of the Gaus- 
sian white noise model and show that if the population covariance 
is embeddable in that model and well-conditioned, then the banded 
approximations produce consistent estimates of the eigenvalues and 
associated eigenvectors of the covariance matrix. The results can be 
extended to smooth versions of banding and to non- Gaussian distri- 
butions with sufflciently short tails. A resampling approach is pro- 
posed for choosing the banding parameter in practice. This approach 
is illustrated numerically on both simulated and real data. 

1. Introduction. Estimation of population covariance matrices from sam- 
ples of multivariate data has always been important for a number of reasons. 
Principal among these are (1) estimation of principal components and eigen- 
values in order to get an interpretable low-dimensional data representation 
(principal component analysis, or PC A); (2) construction of linear discrim- 
inant functions for classification of Gaussian data (linear discriminant anal- 
ysis, or LDA); (3) establishing independence and conditional independence 
relations between components using exploratory data analysis and testing; 
and (4) setting confidence intervals on linear functions of the means of the 
components. Note that (1) requires estimation of the eigenstructure of the 
covariance matrix while (2) and (3) require estimation of the inverse. 
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The theory of multivariate analysis for normal variables has been well 
worked out — see, for example [1]. However, it became apparent that ex- 
act expressions were cumbersome, and that multivariate data were rarely 
Gaussian. The remedy was asymptotic theory for large samples and fixed 
relatively small dimensions. In recent years, datasets that do not fit into this 
framework have become very common — the data are very high-dimensional 
and sample sizes can be very small relative to dimension. Examples include 
gene expression arrays, fMRI data, spectroscopic imaging, numerical weather 
forecasting, and many others. 

It has long been known that the empirical covariance matrix for samples 
of size n from a p-variate Gaussian distribution, A/'p(/x, Ep), has unexpected 
features if both p and n are large. If p/n ^ c G (0, 1) and the covariance ma- 
trix T,p = I (the identity) , then the empirical distribution of the eigenvalues 
of the sample covariance matrix Sp follows the Marcenko-Pastur law [26], 
which is supported on ((1 — y^)^, (1 + \/c)^)- Thus, the larger p/n, the more 
spread out the eigenvalues. Further contributions to the theory of extremal 
eigenvalues of Sp have been made in [3, 14, 32], among others. In recent 
years, there have been great developments by Johnstone and his students 
in the theory of the largest eigenvalues [21, 28] and associated eigenvectors 
[22]. The implications of these results for inference, other than indicating 
the weak points of the sample covariance matrix, are not clear. 

Regularizing large empirical covariance matrices has already been pro- 
posed in some statistical applications — for example, as original motivation 
for ridge regression [17] and in regularized discriminant analysis [12]. How- 
ever, only recently has there been an upsurge of both practical and theoreti- 
cal analyses of such procedures — see [10, 13, 18, 25, 33] among others. These 
authors study different ways of regularization. Ledoit and Wolf [25] consider 
Steinian shrinkage toward the identity. Furrer and Bengtsson [13] consider 
"tapering" the sample covariance matrix, that is, gradually shrinking the off- 
diagonal elements toward zero. Wu and Pourahmadi [33] use the Cholesky 
decomposition of the covariance matrix to perform what we shall call "band- 
ing the inverse covariance matrix" below, and Huang et al. [18] impose Li 
penalties on the Cholesky factor to achieve extra parsimony. Other uses of 
Li penalty include applying it directly to the entries of the covariance ma- 
trix [2] and applying it to loadings in the context of PCA to achieve sparse 
representation [34]. Johnstone and Lu [22] consider a different regularization 
of PCA, which involves moving to a sparse basis and thresholding. Fan, Fan 
and Lv [10] impose sparsity on the covariance via a factor model. 

Implicitly these approaches postulate different notions of sparsity. Wu 
and Pourahmadi's interest focuses, as does ours, on situations where we can 
expect that |^ — j| large implies near independence or conditional (given the 
intervening indexes) independence of Xi and Xj . At the very least our solu- 
tions are appropriate for applications such as climatology and spectroscopy, 
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where there is a natural metric on the index set. Huang et al.'s regulariza- 
tion is more flexible but also depends on the order of variables. In other 
contexts, notably in finance applications, the sparsity implicit in a factor 
model of Fan et al. is more suitable. Johnstone and Lu's method relies on 
the principal eigenvectors being sparse in some basis. 

The asymptotic frameworks and convergence results, if at all considered, 
vary among these studies. Wu and Pourahmadi [33] consider convergence in 
the sense of single matrix element estimates being close to their population 
values in probability, with p„ — > oo at a certain rate determined by the spline 
smoothers they used. Ledoid and Wolf [25] show convergence of their esti- 
mator in "normalized" Frobenius norm \ip/n is bounded, whereas 
Furrer and Bengtsoon [13] use the Frobenius norm itself, ||^|||' = tT{AA^), 
which we shall argue below is too big. Fan, Fan and Lv [10] also show that 
the Frobenius norm is too big in the factor model context, and employ a 
different norm based on a sequence of covariance matrices, which is closely 
related to the entropy loss [20]. Johnstone and Lu [22] show convergence of 
the first principal component of their estimator when p/n^ const. 

We have previously studied [5] the behavior of Fisher's discriminant func- 
tion for classification as opposed to the so-called "naive Bayes" procedure 
which is constructed under the assumption of independence of the compo- 
nents. We showed that the latter rule continues to give reasonable results 
for well-conditioned 5]„ as long as — > while Fisher's rule becomes 
worthless if p/n^ oo. We also showed that using /c-diagonal estimators of 
the covariance achieves asymptotically optimal classification errors if Sp is 
Toeplitz and A;„ — > oo at a certain rate. However, the performance of the 
banded estimators was only evaluated in the context of LDA. 

In this paper we show how, by either banding the sample covariance 
matrix or estimating a banded version of the inverse population covariance 
matrix, we can obtain estimates which are consistent at various rates in the 
operator norm as long as and Sp ranges over some fairly natural 

families. This implies that maximal and minimal eigenvalues of our estimates 
and Sp are close. We introduce the banding approach for the covariance 
matrix and for the Cholesky factor of the inverse in Section 2. In Section 3 we 
give the main results: description of classes of covariance matrices for which 
banding makes sense (Section 3.1), convergence and rates results for the 
banded covariance estimator (Section 3.2), which we generalize to smooth 
tapering (Section 3.3) and extend to banding the inverse via its Cholesky 
factor (Section 3.4). In Section 4 we introduce an analogue of the Gaussian 
white noise model for covariance matrices and show that if our matrices 
are embeddable in that model and are well conditioned, then our banded 
approximations are such that the eigenstructures (individual eigenvalues and 
associated eigenvectors) of the estimate and population covariance are close. 
Another approximation result not dependent on existence of the limit model 
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is presented as well. In Section 5 we describe a resampling scheme that can 
be used to choose the banding parameter k in practice. In Section 6 we give 
some numerical results, from both simulations and real data. Both theory 
and simulations indicate that the optimal k depends on p, n, and the amount 
of dependence in the underlying model. Section 7 concludes with discussion, 
and the Appendix contains all the proofs. 

2. The model and two types of regularized covariance estimates. We 

assume throughout that we observe Xi, . . . , X.„, i.i.d. p-variate random vari- 
ables with mean and covariance matrix Sp, and write 

Xj = {Xii, . . . ,Xip)'^. 

For now we will assume that the Xj are multivariate normal, and shall argue 
separately that it suffices for Xy to have subexponential tails for all j (see 
Extension I after Theorem 1). We want to study the behavior of estimates 
of Tip as both p and n — > oo. It is well known that the usual MLE of Sp, the 
sample covariance matrix, 

(1) Sp = -^(X,-X)(X,-Xf 

1=1 

behaves optimally as one might expect if p is fixed, converging to Sp at rate 
n~^^'^. However, as discussed in the Introduction, if p— > oo, Sp can behave 
very badly unless it is "regularized" in some fashion. Here we consider two 
methods of regularization. 

2.1. Banding the sample covariance matrix. For any matrix M = [mij]pxp, 
and any <k <p, define, 

Bk{M) = [mijl{\i-j\<k)] 

and estimate the covariance by S^^p = = i?fc(Sp). This kind of regulariza- 
tion is ideal in the situation where the indexes have been arranged in such 
a way that in Ep = [cjjj] 

\i — j\ > k^ Uij = 0. 

This assumption holds, for example, if Sp is the covariance matrix of Yi , . . . , 1^, 
where Yi, l2;--- is a finite inhomogeneous moving average process, Yt = 
X]j=i o.t,t-j£j, and Ej are i.i.d. mean 0. However, banding an arbitrary covari- 
ance matrix does not guarantee positive definiteness — see a generalization 
to general tapering which avoids this problem in Section 3.3. 
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2.2. Banding the inverse. This method is based on the Cholesky de- 
composition of the inverse which forms the basis of the estimators pro- 
posed by Wu and Pom'ahmadi [33] and Huang et al. [18]. Suppose we have 
X = {Xi, . . . , Xp)"^ defined on a probability space, with probability measure 
P, which is Mp{0, Hp), Hp = [cij]. Let 



(2) Xj = J2(^jtXt = Z 



t=i 

be the L2{P) projection of Xj on the linear span of with 
Zj = (Xi, . . . , Xj-i)^ the vector of coordinates up to j — 1, and aj = (aji, . . . , 
ajj-i)'^ the coefficients. If j = 1, let Xi = 0. Each vector aJ can be computed 
as 

(3) aj = (Var(Zj))"^ Cov{Xj,Zj). 

Let the lower triangular matrix A with zeros on the diagonal contain the 
coefficients aj arranged in rows. Let ej = Xj — Xj., (i| = Var(ej) and let 
D = diag((if , . . . , dp) be a diagonal matrix. The geometry of L2{P) or stan- 
dard regression theory imply independence of the residuals. Applying the 
covariance operator to the identity e = {I — A)X gives the modified Cholesky 
decompositions of Sp and Sp^: 

Sp = (/-^)-^D[(/-A)-T, 

(4) 

^~' = {I-AfD~\l-A). 

Suppose now that k <p.lt is natural to define an approximation to Sp by 

restricting the variables in regression (2) to Z^-^^ = (Xn^ax(j-fc,i)) • • • i 

that is, regressing each Xj on its closest k predecessors only. Replacing 

Zj by Tij^'^ in (3) gives the new coefficients ^j^^ ■ Let be the /c-banded 

lower triangular matrix containing the new vectors of coefficients ^ and 
let -Dfc = diag((ij ^) be the diagonal matrix containing the corresponding 
residual variances. Population A;-banded approximations S^^p and are 
obtained by plugging in A^ and D^. in (4) for A and D. 

Given a sample Xi, . . . ,X„, the natural estimates of A]f_ and are ob- 
tained by performing the operations needed under P, the empirical distri- 
bution, that is, plugging in the ordinary least squares estimates of the coef- 
ficients in Ak and the corresponding residual variances in D]^. In the general 
case the variables must be centered first. We will refer to these sample esti- 
mates as Ak = [Ojt^], and Dk = diag((ij ^). Plugging them into (4) for A and 
D gives the final estimates of and Sp via the Cholesky decomposition, 
which we will refer to as and Sfc,p, respectively. 
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Note that since is a fc-banded lower triangular matrix, is /c-banded 
nonnegative definite. Its inverse is in general not banded, and is different 
from Sfc. Similarly, E^"^ is not the same as Bf;{T,~^), which is in any case 
ill-defined when p> n. 

3. Main results. All our results can be made uniform on sets of co- 
variance matrices which we define in Section 3.1. Banding the covariance 
matrix is analyzed in Section 3.2 and generalized to tapering in Section 3.3; 
results on banding the inverse via the Cholesky decomposition are given in 
Section 3.4. All the results show convergence of estimators in the matrix 
L2 norm, ||M|| = sup{||Mx|| : ||x|| = 1} = Xllax{M^M), which for symmetric 
matrices reduces to ||M|| = maxj |Aj(M)|. 

3.1. Classes of covariance matrices. All our sets will be subsets of the 
set which we shall refer to as well- conditioned covariance matrices, T,p, such 
that, for all p, 

0<e< Amin(Sp) < Amax(5]p) < l/s < OO. 

Here, Ainax(Sp), Amin(Sp) are the maximum and minimum eigenvalues of 
Tip, and e is independent of p. 

As noted in Bickel and Levina [5], examples of such matrices include 
covariance matrices of {Ui, . . . , Up)'^ where {Ui,i > 1} is a stationary ergodic 
process with spectral density /, 0<e</<^ and, more generally, of Xi = 
Ui + Vi, i = 1, . . . , where {Ui\ is a stationary process as above and {V^} is a 
noise process independent of {f/j}. This model includes the "spiked model" 
of Paul [28] since a matrix of bounded rank is Hilbert-Schmidt. 

We define the first class of positive definite symmetric well-conditioned 
matrices S = [cjij] as follows: 

U{£Q,a,C) = < S:max^{|(Tij| : \i - j\ > k} < Ck~°' for all A; > 0, 

and < eo < Amin(S) < Amax(S) < I/eq >. 

Contained in hi for suitable 0, C is the class 

C{eo, m, C) = {S : aij = a{i — j) (Toeplitz) with spectral density /s, 

0<eo<||/E||oo<eoMl/s™^lloc<C}, 

where denotes the mth derivative of /. By Grenander and Szego [16], 
if S is symmetric, Toeplitz, S = [a{i — j)], with cr(— /c) = cr(k), and S has an 
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absolutely continuous spectral distribution with Radon-Nikodym derivative 
fj:{u), which is continuous on (—1,1), then 

\\m =sup|/E(n)|, 

(6) 



mf|/E(n)| 



Since H/^^^Hoo < C implies that 

(7) \a{t)\<Ct-"' 

which in turn implies J2t>k l*^!*)! <C{m— l)~^/c~"^+^, we conclude from 6 
and 7 that 

(8) £(eo,m,C) cZ^(eo,m- 1,C). 

A second uniformity class of nonstationary covariance matrices is defined by 

/C(m, C) = {S : an < Cr"", all i}. 

The bound C independent of dimension identifies any limit as being of "trace 
class" as operator for m> 1. 

Although fC is not a well-conditioned class, 

(9) T {eo, mi,m2,Ci,C2) = C{eo,mi,Ci) © /C(m2, C2) CU{e, a, C), 

wherea = min{mi-l,m2/2-l}, C < (Ci /(mi - 1) + (72/(^2/2 - 1), < 
Eq^ + C2- To check claim (9), note that 

eo < KiniL) < Xmin{L + K) < Amax(i + K) 

< ||L|| + ||K|| <e^^ + C2 



and 



max V \KiA< max V \Kii\^/^\KjA^/'^ 

i-\i—j\>k i:\i~j\>k 



< 02(7712/2 -iy^k-"'^/^+\ 



p 

max > IK ..n'-iK ..:--< |fr..|V2 

i:\i-j\>k i=k+2 



< (^2(^2/2 -l)(A: + 2)"'"2/2+i. 



We will use the T and C classes for Sp and ^ for convenience. 
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3.2. Banding the covariance matrix. Our first result establishes rates of 
convergence for the banded covariance estimator. 

Theorem 1. Suppose that X is Gaussian andL{{eQ,a,C) is the class of 
covariance matrices defined above. Then, i/ A;.„ x (n~^ logp)"-'^/^^^"^^)), 

(10) i,t,,„-.,Ho.((!f;)""^'°^->iis-,-E-ii 

uniformly on T, gU. 

The proof can be found in the Appendix. Note that the optimal A;„ in 
general depends not only on p and n, but also on the dependence structure 
of the model, expressed by a. An approach to choosing k in practice is 
discussed in Section 5. 

From Theorem 1, we immediately obtain: 

Corollary 1. If a = min{?7T,i - 1, ^ - 1}, mi > 1, m2 > 2, then (10) 

holds uniformly for £T{eQ,ml,m2,Cl,C2)■ 
Extensions of Theorem 1. I. The Gaussian assumption may be replaced 

by the following. Suppose Xj = {Xn, . . . ,Xip)'^ are i.i.d., Xij ~ Fj, where 

Fj is the c.d.f. of Xy, and Gj{t) = Fj{Vi) - Fj{-y/t) is the c.d.f. of Xjj. 

Then for Theorem 1 to hold it suffices to assume that 

roo 

(11) max / exp(At) dGj{t) < oo for < |A| < Aq 

i<i<p JO 

for some Aq > 0. This follows by using the argument of Lemma A. 3 and 
verifying condition (P) on page 45 of [29]. 

II. If we only assume E\Xij\l^ ^G, /3 > 2, for all j, we can replace (A4) 
in the proof of Theorem 1 by 

(12) P[\\Bk{±^) - Bk{^p)\\^ >t]< Gn-I"\2k + l)pt-P'\ 

Then a few appropriate modifications of the proof (details omitted here) 
imply that if kn x (n-^/V^'^)"'^^"^ where 7(a) = (1 + a + 2//3)-\ then, 

(13) pfcjS) - Spll = Op((n-VV//3)-7(a)). 
The rate of kn is still asymptotically optimal. 

Remarks on convergence rates. (1) Theorem 1 implies that \\Bk (S) — 
^ if l^f^ ^ 0, uniformly on U. It is not hard to see that if = 5 + K 
where S is Toeplitz, Eq < fs 1^ and K is trace class in the sense of 
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Section 4, T,iK{i,i) < oo, then, if — > 0, there exist kn ] oo such that, for 
the given {Sp}, 



(14) - Spll + m^it)]-^ - ^0 

(2) The same claim can be made under (11). On the other hand, under 
ily the moment bound of Ext( 
only conclude that (14) holds if 



only the moment bound of Extension II with Ee = oo, A > we may 



(15) ^-0. 



n 



Related results of Furrer and Bengtsson [13] necessarily have rates of the 
type (15) not because of tail conditions on the variables, but because they 
consider the Frobenius norm. 



3.3. General tapering of the covariance matrix. One problem with simple 
banding of the covariance matrix is the lack of assured positive definiteness. 
However, Furrer and Bengtsson [13] have pointed out that positive definite- 
ness can be preserved by "tapering" the covariance matrix, that is, replacing 
Sp with Ep * R, where * denotes Schur (coordinate-wise) matrix multiplica- 
tion, and R = [rij] is a positive definite symmetric matrix, since the Schur 
product of positive definite matrices is also positive definite. This fact was 
proved by Schur [30] and is also easily seen via a probabilistic interpreta- 
tion: if X and Y are independent, mean random vectors with Cov(X) = A, 
Cov(Y) = B, then Cov(X *Y) = A*B. 

In the general case, let A be a countable set of labels of cardinality |^|. 
We can think of a matrix as [mablaeA. beA- Let p : A x p{a,a) = 

for all a, be a function we can think of as distance of the point (a, b) from 
the diagonal. As an example think of a and b as identified with points in 
i?™ and p{a, b) = \a — b\ where | • | is a norm on R"^. 

Now suppose R = [rab]a,beA is symmetric positive definite with rab = 
g{p{a,b)), g:R^ R^ . Suppose further that g{Q) = 1 and g is decreas- 
ing to 0. Then R* M is a regularization of M. Note that g{t) = l{t < k), 
p{hj) = K ~ j\ gives banding (which is not nonnegative definite). 

In general, let i?^ = [''aicLjb)], where 

r>,6)=,(^), a>0. 

Assumption A. g is continuous, g{0) = 1, g is nonincreasing, ^(cxo) = 0. 
Examples of such positive definite symmetric Ra include 

r^{i,j) = (l-\l^) and r^{i,j) = e~\'~^\/'' . 
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With this notation define 

R4M) = [mabraia,b)] 
with RoiM) = M. Clearly, as cr ^ oo, R„{M) M. 

Our generalization is the following. Denote the range of ga{p{a,b)) by 
{5o-(/0i)> ■ • ■ 75cr(pL)} where {0 < pi < • • • < pi} is the range of p{a,b), a€A, 
b € A. Note that L depends on \ A\ = p. 

Theorem 2. Let A(cj^) = J2h=i9o-{pi)- Note that A depends on \A\ =p 
and the range of p. Suppose Assumption A holds. Then if 

Ax(n-ilogp)-V2(a+i)^ 

the conclusion of Theorem 1 holds for Ra-{T,). 

The proof of Theorem 2 closely follows the proof of Theorem 1 with (A3) 
replaced by Lemma A.l in the Appendix. Both the result and the lemma are 
of independent interest. The remarks after Theorem 1 generalize equally. 

3.4. Banding the Cholesky factor of the inverse. Theorems 1 and 2 give 
the scope of what can be accomplished by banding the sample covariance 
matrix. Here we show that "banding the inverse" yields very similar results. 

If = r(S)^L>-i(E)T(E) with T(S) lower triangular, r(S) = [tij(S)], 

let 

U-\eo,C, a) = |S : < £0 < Xmin{^) < Amax(S) < S 

max ^ < Cfc~" for allk<p-l 

j<i—k 

Theorems. Uniformly for eU~^{eo,C, a), i//c„ x (n~-^ logp)~^/^("+-^) 
and logp = o(l) , 

\\'^klp-^P^\\ = Opi^i^^^ ) = ||Sfc„,p-5]p||. 

The proof is given in the Appendix. Note that the condition n^^logp = 
o(l) is needed solely for the purpose of omitting a cumbersome and unin- 
formative term from the rate (see Lemma A. 2 in the Appendix for details). 

It is a priori not clear what S SZ^~^ means in terms of S. The following 
corollary to Theorem 3 gives a partial answer. 
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Corollary 2. For m>2, uniformly on C{eo,m,C), if kn >^ {n ^ x 
logp)-i/2™, 



n 



The proof of Corollary 2 is given in the Appendix. The reason that the 
argument of Theorem 1 cannot be invoked simply for Theorem 3 is that, as 
we noted before, is not the same as which is not well defined 

if p > n. 

4. An analogue of the Gaussian white noise model and eigenstructure ap- 
proximations. In estimation of the means /x^ of p- vectors of i.i.d. variables, 
the Gaussian white noise model [9] is the appropriate infinite-dimensional 
model into which all objects of interest are embedded. In estimation of ma- 
trices, a natural analogue is the space B{l2,l2), which we write as B, of 
bounded linear operators from I2 to l2- These can be represented as matri- 
ces ['mij]i>i,j>i such that rriijXj]'^ < 00 for all x = {xi,X2, . . .) G h- It 
is well known (see Bottcher [7], e.g.) that if M is such an operator, then 

\\Mf = sup{(Mx,Mx) : |x| = 1} = sup5(M*M), 

where M*M is a self-adjoint member of B with nonnegative spectrum S. 
Recall that the spectrum S{A) of a self-adjoint operator is TZ'^{A), where 
TZ{A) = {X€ R:A — A J G B} where J is the identity. To familiarize ourselves 
with this space we cite some properties of Ti £ B where 

(16) S = [Cov(X(i),X(i))]i,,->i 

is the matrix of covariances of a Gaussian stochastic process {X{t):t = 
1,2,...}: 

1. It is easy to see that the operators S for all ergodic AR processes, 
X{t) = pX{t - 1) + e{t) where e{t) are i.i.d. J\f{0, 1) and \p\ < 1, are in B, 
and S"^ G B. This is, in fact, true of all ergodic ARMA processes. On the 
other hand, X{t) = X]j=i^(j) is evidently not a member of B. 

2. The property T, £ B, G B which we shall refer to as being well con- 
ditioned, has strong implications. By a theorem of Kolmogorov and Rozanov 
(see [19]), if S is Toeplitz, this property holds iff the corresponding station- 
ary Gaussian process is strongly mixing. 

We now consider sequences of covariance matrices Sp such that Sp is the 
upper pxp matrix of the operator S G That is, S is the covariance matrix 
oi{X{t): t = l,2,...} and Ep that of (X(l), . . . , A:(p)). 

By Bottcher [7], if S is well conditioned, then 

T.p{x) S(x) 
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as p ^ oo for all x G 12- We now combine Theorem 6.1, page 120 and The- 
orem 5.1, page 474 of Kato [24] to indicate in what ways the spectra and 
eigenstructures (spectral measures) of Bk„{T,p) are close to those of Sp. 

Suppose that the conditions of the remark following Theorem 1 hold. 
That is, Sp corresponds to T, = S + K , where S £ B is a Toeplitz operator 
with spectral density fs such that, < Eq < fs < and K is trace class, 
J2u -^(^5 u) < oo which implies K G B. 

Let M be a symmetric matrix and O be an open set containing S{M) = 
{Ai,...,Ap} where Ai(M) > A2(M) > ••• > Xp{M) are the ordered eigen- 
values of M and let E{M){-) be the spectral measure of M which assigns 
to each eigenvalue the projection operator corresponding to its eigenspace. 
Abusing notation, let Ep = E{T,p), Ep = p), S = 5(Sp). Then, EpiO) = 
Ep{S) = J, the identity. 

Theorem 4. Under the above conditions on Sp, 
(17) \Ep{0){x)-x\^Q 

for all X £ l2- Further, if I is any interval whose endpoints do not belong to 
S, then 

\Ep{I nS){x) - Ep{I){x)\ -^0. 

Similar remarks apply to S^ p. This result gives no information about 
rates. It can be refined (Theorem 5.2, page 475 of Kato [24]) but still yields 
very coarse information. One basic problem is that S typically has at least 
in part continuous spectrum and another is that the errors involve the ir- 
relevant bias |(Sp — T,){x)\. Here is a more appropriate formulation whose 
consequences for principal component analysis are clear. Let 

^(e,a,C, A,m) 

(18) 

= {Sp G U{e, a, C) : A, (Sp) - A,_i(5]p) > A, 1 < j < m}. 

Thus the top m eigenvalues are consecutively separated by at least A and 
all eigenvalues A^ with j >m + l are separated from the top m by at least 
A. Furthermore, the dimension of the sum of the eigenspaces of the top m 
eigenvalues is bounded by I independent of n and p. We can then state 

Theorem 5. Uniformly on Q as above, for k as in Theorem 1, X Gaus- 
sian, 

logp\^/^/ a 



I Xj (Sfc,p) - A,- (Sp) I = Op ( ( ^ 1 ( log n + - logp 

(19) \^ ^ 

= \\Ej{^k,p) - Ej{^p)\\ for 1 < J < m. 
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That is, the top m eigenvalues and principal components of Sp, if the 
eigenvalues are all simple, are well approximated by those of S^. p. If we 
make an additional assumption on Sp, 

we can further conclude that the top m principal components of S^^p capture 
100(1 — 6)% of the variance of X. To verify (20) we need that 

(21) ''^f%^'^ =op{l). 

tr(Sp) 

This holds if, for instance, tr(Sp) =^}p{p) which is certainly the case for all 
Sp £ T. Then, Theorem 5 follows from Theorem 6.1, page 120 of Kato [24], 
for instance. For simplicity, we give a self-contained proof in the Appendix. 

5. Choice of the banding parameter. The results in Section 3 give us the 
rate oik = kn that guarantees convergence of the banded estimator S^, but 
they do not offer much practical guidance for selecting k for a given dataset. 
The standard way to select a tuning parameter is to minimize the risk 

(22) R{k) = E\\tk-n\(i,i), 
with the "oracle" k given by 

(23) /cq = argmini?(A;). 

k 

The choice of matrix norm in (22) is somewhat arbitrary. In practice, we 
found the choice of k is not sensitive to the choice of norm; the li to li 
matrix norm does just slightly better than others in simulations, and is also 
faster to compute. 

We propose a resampling scheme to estimate the risk and thus k^: divide 
the original sample into two samples at random and use the sample covari- 
ance matrix of one sample as the "target" to choose the best k for the other 
sample. Let ni, n2 = n — ni be the two sample sizes for the random split, 
and let S^, Sg*^^ be the two sample covariance matrices from the z^th split, 
for v = \,. . . ,N . Alternatively, N random splits could be replaced by K-iold 
cross-validation. Then the risk (22) can be estimated by 

(24) R{k) = -Y.\\Bk{T^i^)-^t^ki,i) 

u=l 

and k is selected as 

(25) = argmini2(A;). 

k 
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Generally we found little sensitivity to the choice of ni and n2, and used ni = 
n/3 throughout this paper. If n is sufficiently large, another good choice (see, 
e.g., Bickel, Ritov and Zakai [6]) is ni = n(l — 1/logn). 

The oracle /cq provides the best choice in terms of expected loss, whereas 
k tries to adapt to the data at hand. Another, and more challenging, com- 
parison is that of k to the best band choice for the sample in question: 

(26) /ci =argmin||Sfc - 

k 

Here ki is a random quantity, and its loss is always smaller than that of k^. 
The results in Section 6 show that k generally agrees very well with both ko 
and ki, which are quite close for normal data. For heavier-tailed data, one 
would expect more variability; in that case, the agreement between k and 
ki is more important that between k and /cq. 

It may be surprising that using the sample covariance S2 as the target in 
(24) works at all, since it is known to be a very noisy estimate of S. It is, 
however, an unbiased estimate, and we found that even though (24) tends 
to overestimate the actual value of the risk, it gives very good results for 
choosing k. 

Criterion (24) can be used to select k for the Cholesky-based as well. 
An obvious modification — replacing the covariance matrices with their in- 
verses in (24) — avoids additional computational cost and instability associ- 
ated with computing inverses. One has to keep in mind, however, that while 
Efc is always well-defined, is only well-defined for k <n, since otherwise 
regressions become singular. Hence, if p > n, k can only be chosen from the 
range 0, . . . , n — 1, not 0, ... ,p — 1. 

6. Numerical results. In this section, we investigate the performance 
of the proposed banded estimator of the covariance and the resam- 
pling scheme for the choice of k, by simulation and on a real dataset. The 
Cholesky-based and its variants have been numerically investigated by 
extensive simulations in [33] and [18], and shown to outperform the sample 
covariance matrix. Because of that, we omit from simulations, and only 
include it in the real data example. 

6.1. Simulations. We start from investigating the banded estimator by 
simulating data from A^(0, Sp) with several different covariance structures 
Sp. For all simulations, we report results for n = 100 and p = 10, 100, and 
200. Qualitatively, these represent three different cases: p <^n, p n and 
p> n. We have also conducted selected simulations with p = 1000, n = 100, 
which qualitatively corresponds to the case p^ n; all the patterns observed 
with p> n remain the same, only more pronounced. The number of random 
splits used in (24) was N = 50, and the number of replications was 100. 
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? 4 6 S 5 10 15 H} 2& 30 5 10 15 20 25 30 

K k k 

Fig. 1. MA(1) covariance: True {averaged over 100 realizations) and estimated risk 
{single realization) as a function of k, plotted for k < 30. Both risks are increasing after 
k = 1 for all p. 

Example 1 {Moving average covariance structure). We take Sp to be 
the covariance of the MA(1) process, with 

aij=p\'-'j\-l{\i-j\<l}, l<i,j<p. 

The true Tip is banded, and the oracle kQ = l for all p. For this example we 
take p = 0.5. Figure 1 shows plots of the true risk R{k) and the estimated 
risk R{k) from (24). While the risk values themselves are overestimated by 
(24) due to the extra noise introduced by S2, the agreement of the minima 
is very good, and that is all that matters for selecting k. 

Table 1 shows the oracle values of /cq and ki, the estimated k, and the 
losses corresponding to all these along with the loss of the sample covariance 
E. When the true model is banded, the estimation procedure always picks 
the right banding parameter k = 1, and performs exactly as well as the 
oracle. The covariance matrix, as expected, does worse. 

Example 2 (Autoregressive covariance structure). Let Sp be the covari- 
ance of an AR(1) process. 



Table 1 

MA(1): Oracle and estimated k and the corresponding loss values 



p 


feo 




Mean (SD) 








Loss 




fei 


k 


k\ — k 




Sfeo 




E 


10 


1 


1 (0) 


1 (0) 


0(0) 


0.5 


0.5 


0.5 


1.2 


100 


1 


1 (0) 


1 (0) 


0(0) 


0.8 


0.8 


0.8 


10.6 


200 


1 


1 (0) 


1 (0) 


0(0) 


0.9 


0.9 


0.9 


20.6 
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For this simulation example, we take p= 0.1, 0.5 and 0.9. The covariance 
matrix is not sparse, but the entries decay exponentially as one moves away 
from the diagonal. Results in Figure 2 and Table 2 show that the smaller 
p is, the smaller the optimal k. Results in Table 2 also show the variability 
in k increases when the truth is far from banded (larger p), which can be 
expected from the flat risk curves in Figure 2. Variability of ki increases as 
well, and ki—k\s not significantly different from 0. In terms of the loss, the 
estimate again comes very close to the oracle. 

Example 3 {Long-range dependence). This example is designed to chal- 
lenge the banded estimator, since conditions (5) will not hold for covariance 
matrix of a process exhibiting long-range dependence. Fractional Gaussian 
noise (FGN), the increment process of fractional Brownian motion, provides 
a classic example of such a process. The covariance matrix is given by 

^ij = - j\ + 1)'^ - - jf ^ + - 1)'^], 1 < i,J<P, 

where G [0.5, 1] is the Hurst parameter. H = 0.5 corresponds to white 
noise, and the larger H, the more long-range dependence in the process. 
Values of H up to 0.9 are common in practice, for example, in modeling 
Internet network traffic. For simulating this process, we used the circulant 
matrix embedding method [4], which is numerically stable for large p. 

Results in Table 3 show that the procedure based on the estimated risk 
correctly selects a large k {k ^p) when the covariance matrix contains strong 
long-range dependence {H = 0.9). In this case banding cannot help — but it 
does not hurt, either, since the selection procedure essentially chooses to do 
no banding. For smaller H, the procedure adapts correctly and selects k = 
for H = 0.5 (diagonal estimator for white noise), and a small k for H = 0.7. 

Another interesting question is how the optimal choice of k depends on 
dimension p. Figure 3 shows the ratio of optimal k to p, for both oracle ko 
and estimated k, for AR(1) and FGN [for MA(1), the optimal k is always 1]. 
The plots conffi'm the intuition that (a) the optimal amount of regularization 
depends on S, and the faster the off-diagonal entries decay, the smaller the 
optimal k; and (b) the same model requires relatively more regularization 
in higher dimensions. 

6.2. Call center data. Here we apply the banded estimators and 
to the call center data used as an example of a large covariance estima- 
tion problem by [18], who also provide a detailed description of the data. 
Briefly, the data consist of call records from a call center of a major U.S. 
financial institution. Phone calls were recorded from 7:00 am until midnight 
every day in 2002, and weekends, holidays and days when equipment was 
malfunctioning have been eliminated, leaving a total of 239 days. On each 
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Fig. 2. AK(1) covariance: True {averaged over 100 realizations) and estimated risk [sin- 
gle realization) as a function of k. 



Table 2 

AR(1): Oracle and estimated k and the corresponding loss values 



Mean (SD) Loss 



p 


P 


ko 


fei 


k 


fci — fe 




Sfeo 




E 


10 


0.1 


1 


0.5 (0.5) 


0.0 (0.2) 


0.5 (0.6) 


0.5 


0.5 


0.4 


1.1 


10 


0.5 


3 


3.3 (0.8) 


2.0 (0.6) 


1.3 (1.1) 


1.1 


1.0 


1.0 


1.3 


10 


0.9 


9 


8.6 (0.7) 


8.9 (0.3) 


-0.4 (0.7) 


1.5 


1.5 


1.5 


1.5 


100 


0.1 





0.2 (0.4) 


0.1 (0.3) 


0.1 (0.6) 


0.6 


0.6 


0.6 


10.2 


100 


0.5 


3 


2.7 (0.7) 


2.3 (0.5) 


0.4 (1.0) 


1.6 


1.6 


1.5 


10.6 


100 


0.9 


20 


21.3 (4.5) 


15.9 (2.6) 


5.5 (5.8) 


9.2 


8.8 


8.5 


13.5 


200 


0.1 


1 


0.2 (0.4) 


0.2 (0.4) 


-0.0 (0.6) 


0.7 


0.6 


0.6 


20.4 


200 


0.5 


3 


2.4 (0.7) 


2.7 (0.5) 


-0.2 (1.0) 


1.8 


1.7 


1.7 


20.8 


200 


0.9 


20 


20.2 (4.5) 


16.6 (2.4) 


3.6 (5.6) 


9.9 


9.7 


9.5 


24.5 
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Table 3 

FGN: Oracle and estimated k and the corresponding loss values 



p 


H 


feo 




Mean (SD) 








Loss 




fei 


k 


fei — fe 




Sfco 




E 


10 


0.5 





0.0 (0.0) 


0.0 (0.0) 


0.0 (0.0) 


0.3 


0.3 


0.3 


1.1 


10 


0.7 


5 


5.0 (1.8) 


2.3 (1.5) 


2.7 (2.5) 


1.4 


1.2 


1.1 


1.2 


10 


0.9 


9 


8.6 (0.6) 


9.0 (0.1) 


-0.4 (0.6) 


1.5 


1.5 


1.5 


1.5 


100 


0.5 





0.0 (0.0) 


0.0 (0.0) 


0.0 (0.0) 


0.4 


0.4 


0.4 


10.2 


100 


0.7 


4 


4.9 (2.2) 


4.1 (1.6) 


0.8 (2.9) 


5.5 


5.5 


5.4 


10.7 


100 


0.9 


99 


82.1 (10.9) 


85.1 (15.5) 


-3.1 (19.0) 


17.6 


16.6 


16.6 


16.6 


200 


0.5 





0.0 (0.0) 


0.0 (0.1) 


-0.0 (0.1) 


0.4 


0.4 


0.4 


20.1 


200 


0.7 


3 


4.2 (2.2) 


4.9 (2.1) 


-0.7 (3.4) 


7.9 


7.7 


7.7 


20.9 


200 


0.9 


199 


164.0 (22.7) 


139.7 (38.9) 


24.3 (47.4) 


37.8 


33.3 


33.3 


33.3 




day, the 17-hour recording period was divided into 10-minute intervals, and 
the number of cahs in each period, Nij, was recorded for each of the days 
i = 1, . . . , 239 and time periods j = 1,. . . , 102. A standard transformation 
Xij = (Nij + 1/4)^/^ was apphed to make the data closer to normal. 

The goal is to predict arrival counts in the second half of the day from 
counts in the first half of the day. Let Xj = (x^-^^ , x^-^^ ) , with x|^^ = {xn , . . . , rrj^si ) , 

and x) = (xi^52, . . . , Xi^io2)- The mean and the variance of Vx are parti- 
tioned accordingly. 
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Time point 

Fig. 4. Call center forecast error using the sample covariance E and the best 
Cholesky-based estimator Efc, k= 19. 

The best linear predictor of x^^^ from x^^^ is then given by 
(28) xf =^2 + S2iSr/(xr^-w). 

Different estimators of S in (27) can be plugged into (28). To compare their 
performance, the data were divided into a training set (January to October, 
205 days) and a test set (November and December, 34 days). For each time 
interval j, the performance is measured by the average absolute forecast 
error 

239 
j=206 

The selection procedure for k described in Section 5 to both and S^. It 
turns out that the data exhibit strong long-range dependence, and for the 
selection procedure picks k = p= 102, so banding the covariance matrix is 
not beneficial here. For S^, the selected k = 19 produces a better prediction 
for almost every time point than the sample covariance S (see Figure 4). 

This example suggests that a reasonable strategy for choosing between 
and Sfc in practice is to estimate the optimal k for both and use the one that 
selects a smaller k. The two estimators are meant to exploit different kinds 
of sparsity in the data, and a smaller k selected for one of them indicates 
that that particular kind of sparsity is a better fit to the data. 

7. Discussion. I. If a*-' = 0, \i — j\ > k and < £q^, then X is a 

A;th-order autoregressive process and as we might expect, S^^p is the right 
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estimate. Now suppose a** < Eq^ for all i and we only know that a^^ = for 
each i and p— {2k + 1) j's. This condition may be interpreted as saying that, 
for each i there is a set Si with \Si\ < k, i ^ Sq, such that, Xi is independent of 
{Xt,t ^ Si,t^ i} given {Xj : j G Si}. Although banding would not in general 
give us sparse estimates, the following seem intuitively plausible: 

(1) Minimize a suitable objective function ^'(P, S) > where P is the em- 
pirical distribution of Xi , . . . , X„ and 

*(^,Sp) = 

subject to II nil (1,1) < 7n,p- 

(2) Let jn,p — > "slowly." This approach should yield estimates which con- 
sistently estimate sparse covariance structure. Banerjee, D'Aspremont 
and El Ghaoui [2] and Huang et al. [18] both use normal or Wishart- 
based log-likelihoods for ^ and a Lasso- type penalty in this context. We 
are currently pursuing this approach more systematically. 

IL The connections with graphical models are also apparent. If D is the 
dependency matrix of S"^, with entries and 1, then ||D||(i i) is just the 
maximum degree of the graph vertices. See Meinshausen and Buhlmann [27] 
for a related approach in determining covariance structure in this context. 

in. A similar interpretation can be attached if we assume S is ko banded 
after a permutation of the rows. This is equivalent to assuming that there is 
a permutation of variables after which Xi is independent of {Xj -.{j — i\ > k} 
for all i. 

IV. In the case when S is the covariance of a stationary process, there is a 
possibility of regularizing the covariance function via inverting a regularized 
estimate of the spectral density. There is a large literature on spectral den- 
sity estimate (see, e.g., [11] for an approach based on local smoothing and 
a review). Exploring this connection further, and in particular, understand- 
ing the equivalent of banding in the spectral domain, is a topic for future 
work. 

APPENDIX: ADDITIONAL LEMMAS AND PROOFS 

In addition to the operator norm ||M|| from I2 to I2 we have defined, in 
what follows we use additional matrix norms. For a vector x = (xi, . . . , Xp)'^ , 
let 

p 

1 1 X 1 1 ]^ ~ ^ ^ I 1 5 1 1 1 1 00 ~ ihSjX I Xj I . 
i=i ^ 
For a matrix M = [mij], the corresponding operator norms from li to li and 
^00 to loo are, respectively, 

11^11(1,1) = sup{||Mx||i : ||x||i = 1} = max^ l^ijl, 
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(Al) 

ll^ll(oo,oo) = sup{||Mx||oo : ||x||oo = 1} =max2j \mij\. 

j 

We will also write ||M||oo = maxjj \mij\. 

For symmetric matrices, ||M||(i = ||M||(cxd oq). The h to li norm arises 
naturally through the inequality (see, e.g., Golub and Van Loan [15]) 

(A2) ||M|| < [||M||(i,i)||M||(oo,oo)]^^^ = for M symmetric. 

Proof of Theorem 1. It is easy to see that (A2) and (A2) imply 
(A3) \\Bk{±) - Bkim = Op{k\\Bk{±) - Sfe(S)IU). 

Let Ti^ = ^ J27=i ^f'^i &iid w.l.o.g. i?Xi = 0. By an application of a re- 
sult of Saulis and Statulevicius [29] (see Lemma A. 3) and the union sum 
inequality, 

(A4) P[\\Bk{t'>)-Bk{J:)\\^>t]<{2k + l)p exp{-nt^j{eo,X)} 

for |t| < A = A(eo). By choosing t = M(i^^^^^)^/^ for M arbitrary we con- 
clude that, uniformly on U, 

\\Bk{t^)-Bk{^p)\\^ = Op{{n-hog{pk))'/^) 

(A5) 

= Op((n-i logp)i/2) 

since k <p. On the other hand, by 5, 

(A6) ||5fc(Sp)-Sp||oo<CA:-" 

for Sp G ^(eoi C). 

Combining (A5) and (A6), the result follows for i?fc(S^). But, if X = 
{Xi, . . . ,Xp)'^, 

||Sfc(S°)-Sfe(S)||<||Sfc(X^X)||<(2A; + l) max |X,f 

i<i<p 



Since 

m^t)]-' - = np{\\Bk„{t) - Spii), 

uniformly on U, the result follows. □ 

The key to Theorem 2 is the following lemma which substitutes for (A3). 
Consider symmetric matrices M indexed by (a, 6), a,b€A, a finite index 
set. Suppose for each a £ A there exist Na < N sets Saj such that the Saj 
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form a partition of ^ — {a}. Define, for any 1 < j < A^, M = [m{a,b)] as 
above: 

r{j) = max{|m(a, b)\: b £ Saj, a £ A} 
and /i = maX(i |m(a, a)| . 

Lemma A.l. Under Assumption A, 

N 

(A7) ||M||</i + 5]r(j). 

i=i 

Proof. Apply (A2) noting that 

N 

for all a£A. □ 

Proof of Corollary 2. An examination of the proof of Theorem 1 
will show that the bound of ||5]p — was used solely to bound 

— But in the case of Corollary 2, a theorem of Kolmogorov 

(De Vore and Lorentz [8], page 334) has, after the identification (Al), 

(A8) ||s,-i?,(S,)||<^^ 

where C depends on C and m only, for all T,p G C{£o,m,C). The result 
follows. Note that Corollary 1 would give the same results as the inferior 
bound C"fc-(™-i). □ 

To prove Theorem 3 we will need: 

Lemma A. 2. Under conditions of Theorem. 3, uniformly on lA, 
(A9) max{||af^ -af^ll^: l< j <p] = Op{n-^/^\og^/'^ p), 
(AlO) max{|d2^ - : 1 < j < p} = Op((n-i logp)°/(2(«+i))) 
and 

(All) Pfc|| = ||Z)^-i|| = 0(l), 

where a!"^ = (a^-^\ . . . , flj*j-_i) are the empirical estimates of the vectors = 
{a^'i , ■ ■ ■ , "j'^j'-i) and (ij 1 < j <p are the empirical estimates of the ^. 
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To prove Lemma A. 2 we need an additional lemma, obtained from results 
of Saulis and Statulevicius [29]. 

Lemma A. 3. Let Zi be i.i.d. AA(0,Sp) and Amax(Sp) < Cq ^ < oo. Then, 
ifT,p = [aab], 



(A12) 



'jk, 



i=l 



> nv 



< Ci exp(— C2nz^^) for \ v\ <6, 
where Ci, C2 and 5 depend on Eq only. 



Proof. Write 



P 



'^{ZijZik — a 



'jk, 



i=l 



> nu 



P 



> 



nu 



{crjjCrkkY/'^ 



i=l 

where pjk = (Tjk{(Tjjakk)~^^'^ and (Z*-, Z*^,) ~ A/'2(0, 0, 1, l,Pjk)- Now, 

n 

^{ZtjZ*k - Pjk) 



i=l 



(A13) 



$:[(z*.+z4)2-2(i+p,,)] 



.1=1 



+ ^[(Z*.-Z4)2-2(l-p,-fc)] 



1=1 



and reduce the problem to estimating 



2P 



i=l 



> 



nv 



2(1 - Pjk){crjjcrkkY^'^ 



where Vi are i.i.d. AA(0, 1). Since Xi satisfies condition (P) (3.12) on page 
45 of [29], the lemma follows from Theorem 3.2, page 45 and (2.13) on page 
19, since {crjjakkY/'^\l - pjk\ < 2^0 ^ □ 

Proof of Lemma A. 2. Note first that 
(A14) llVarX- VarXlloo = Op(n-^/2iogi/2^)^ 

by Lemma A. 3. Hence, 

(A15) max||Var"\zf)) - Var-i(zf))||^ = Op(n-V2iQgi/2^)^ 
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To see this, note that the entries of VarX — VarX can be bounded by 
n~^\J27=i^iaXib - (Jabl + n~'^\J2i^iXia\\J2i=iXib\, where w.l.o.g. we as- 
sume i?X = 0. Lemma A. 3 ensures that 



max 

a,b 



n 



^ ^^{^ia^ib — O'ab) 



i=l 



> V 



<C7ip2exp(-C2nz^2) 



for \v\ < 6. Now take u = Y^'^M for M arbitrary. The second term is 
similarly bounded. 
Also, 

\\^~'\\ = \\{Y^rXr'\\<e^\ 
Claim (A9) and the first part of (All) follow from 3, (A14) and (A15). Since 

/ i-i 



d 



,., = VarX,-Var( ^ dfj Xt 

\t=j-k 



d2^ = VarXj -Var ^ a)/ Xt 



(A16) 



and the covariance operator is linear, 

\d]k - ^ I Var(Xj) - VaxXjl 

t=j~k 

i-i i-i 
Var E 4?^*-^^^ E 4?^* 

t=j—k t=j—k 

The sum J2tZj-k understood to be Sf=max{i j-fc)' "^^^ maximum over j 
of the first term is Op(n~^/2 log^^^p) by Lemma A. 3. The second can be 
written as 

I E {(4? - 4? )(4? - a'ff)C^HXs,X,):j -k<s,t<j-l} 



< E |a}r-«}riVar--(X,, 



(A17) 



\t=j-k 



<k max(a^-^ — a^-^') maxVar(Xt) 



Op{k\-\logpf) = Op((n-i logp)"/(2("+i))) 
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by (A9) and < e^^ . Note that in the last equahty we used the assump- 
tion n~^logp = o{l). The third term in (A16) is bounded similarly. Thus 
(AlO) follows. Further, for I <j <p, 

d% = Var (Xj - {(^fXt : max(l, j -k)<t<j-l}] 

(A18) 

>^o(l + E(4?)')>^o 
and the lemma follows. □ 

Proof of Theorem 3. We parallel the proof of Theorem 1. We need 
only check that 

(A19) JIU = Op(n-V2iogi/2p) 

and 

(A20) ||S,]-i?fc(S-i)||=0(fc-°). 
We first prove (A19). By definition, 

k,p k,p 

(A21) 

= (/ - Ak)D^\l - Akf -{I- Ak)Dl\l - Akf 

where A]^, are the empirical versions of A^ and D^. Apply the standard 
inequality 

P(l)^(2)^(3)_ ^(1)^(2)^(3) II 
3 

(A22) <Ep(j') n WB^^'^W 

j=l k^j 

+ E 11^^'^ II n p^''^ - ^^'^11 + n p^'^ - ^^'^11- 
j=i k^j j=i 

Take = [^(3)]T = i-Ak, = [^(3)]^ = j - Ak, A^^^ = \ ^(2) = 
D~^^ in (A22) and (A19) follows from Lemma A. 2. For (A20), we need only 
note that for any matrix M, 

\\MM'^ -Bk{M)Bk{M^)\\ 

< 2||M||||Bfc(M) -M|| + \\BkiM)-Mf 

and (A20) and the theorem follows from our definition of U^^. □ 

Lemma A. 4. Suppose S = [p{j — i)] is a Toeplitz covariance matrix; 
p{k) = p{—k) for all k, T, £ C{eo,m,C). Then, if f is the spectral density of 
E; 
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(i) J:~' = [p{j-z)], p{k)=p{-k), 

(ii) S ^ has spectral density j, 

(iii) S"iG£(eo,m,C'(m,eo,C)). 

Proof. That UjY'^^Woo < C'{m,eo,C) and eo < ||j||oo < IS imme- 
diate. The claims (i) and (ii) fohow from the identity, j = J2T=-oo Pik)e^'^'^'' 
in the L2 sense and 



1= E '5ofce2'^'^" = /(n)-(n). 

k=—oo •' 

Proof of Corollary 2. Note that S g C{eo,m,Co) imphes that 

00 

(A23) /e ^''^('") =ao + ^ak cos(27rj? 



□ 



is itself m times differentiable and 



(A24) 
But then, 



(A25) 



/s-i =bo + Y^ bj cos 27rjn 
i=i 

(00 
ao + E aj cos 27rji 



where bi = X]j=o^j^i-i- formal operations are justified since J2'j^o \ ^j\ ^ 
00 follows from Zygmund [35], page 138. But (A25) can be reinterpreted in 
view of Lemma A. 4 as = AA^ where A = [ai-jl{i > j)] and aj are real 
and given by (A25). Then, if A^ = B^iA), Bk[A)Bl{A) has spectral density, 



(A26) 



/ k .2 
(") = «i cos 2TTju . 
\i=o ) 



Moreover, from (A25) and (A26) 



< 



00 



aj COS 27r jn 



11/^ 



-1/2, 



+ 



Oj cos 27rjti 



By (A24) lojl < Cj"™, hence finally. 



(A27) 



ll/y 



/s-i||oo<CA:-(™-i). 
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Corollary 2 now follows from (A27) and (A9) and (AlO) by minimizing 

Proof of Theorem 5. We employ a famous formula of Kato [23] and 
Sz.-Nagy [31]. If R{X,M) = {M - \J)-^ for A G S", the resolvent set of M 
and Ao is an isolated eigenvalue, [A — Ao| > A for all A G 5, A 7^ Aq, then 
(formula (1.16), page 67, Kato [24]) 

(A28) Eo{x) = ^ [ R{X,M)dX, 

where Eq is the projection operator on the eigenspace corresponding to Aq 
and r is a closed simple contour in the complex plane about Aq containing 
no other member of S. The formula is valid not just for symmetric M but 
we only employ it there. We argue by induction on m. For m = 1, |Ai(M) — 
Ai(A^)| < ||Af — A^ll for M, N symmetric by the Courant-Fischer theorem. 
Thus, if [[Sfc^p — Spll < Y (say), we can find F containing Ai(Efc^p) and Ai(Ep) 
and no other eigenvalues of either matrix with all points on F at distance 
at least A/4 from both Xi{T,k,p) and Ai(Sp). Applying (A28) we conclude 
that 

ll^i-^ill <max{||i?(A,Sp)||||i2(A,Sfc,p)||}||Sfc,p-S||. 

But, ||i?(A,Sp)|| < {maxj \X — Xj(T,p)\}~^ < 2/A by hypothesis, and similarly 
for ||i2(A,Efc^p)||. Therefore, 



(A29) < 16A"^||Sfe,p-S||, 

and the claims (19) and (20) are established for m = 1. We describe the 
induction step from m = \ to m = 2 which is repeated with slightly more 
cumbersome notation for all m (omitted). Consider a unit vector, 

p 

(A30) x = Y^ EjX _L Eix = {Ei - Ei)x + ( J - Ei)x. 

i=2 



Then, 
(A31) 
Therefore, 



\{x,^k,px) - {{J - Ei)T,k^p{J - Ei)x,x)\ 

< ||Sfc,p||(2||i;i-^i|| + ||^i-^i||2). 

A2(Sfc,p) = may:{{x, {J - Ei)T,k,p{J - Ei)x) :\x\ = 1} 
<0(||ii-Si||) + A2(Sp). 
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Inverting the roles of '^k,p and Sp, we obtain 

|A2(Sfc,p) - A2($]p)| = Op{\\tk,p - Spll). 
Now repeating the argument we gave for (A29), we obtain 

(A32) \\E2-E2\\=Op{\\tk,p-^p\\). 

The theorem follows from the induction and Theorem 1. □ 

Note that if we track the effect of A and m, we in fact have 
\\Ej - EjW = Op(jA-2||s^^p _ Spll), 1 < i < m. 
Also note that the dimension of Yl^=i immaterial. 
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